Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset, but, first, some background.
The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.
Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us.
Answer the following questions in one paragraph for each.
What makes voter behavior prediction (and thus election forecasting) a hard problem?
What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?
What went wrong in 2016? What do you think should be done to make future predictions better?
election.raw = read.csv("data/election/election.csv") %>% as.tbl
census_meta = read.csv("data/census/metadata.csv", sep = ";") %>% as.tbl
census = read.csv("data/census/census.csv") %>% as.tbl
census$CensusTract = as.factor(census$CensusTract)
Following is the first few rows of the election.raw data:
| county | fips | candidate | state | votes |
|---|---|---|---|---|
| NA | US | Donald Trump | US | 62984825 |
| NA | US | Hillary Clinton | US | 65853516 |
| NA | US | Gary Johnson | US | 4489221 |
| NA | US | Jill Stein | US | 1429596 |
| NA | US | Evan McMullin | US | 510002 |
| NA | US | Darrell Castle | US | 186545 |
The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.
In our dataset, fips values denote the area (US, state, or county) that each row of data represent: i.e., some rows in election.raw are summary rows. These rows have county value of NA. There are two kinds of summary rows:
fips value of US.fips value.Following is the first few rows of the census data:
| CensusTract | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001020100 | Alabama | Autauga | 1948 | 940 | 1008 | 0.9 | 87.4 | 7.7 | 0.3 | 0.6 | 0.0 | 1503 | 61838 | 11900 | 25713 | 4548 | 8.1 | 8.4 | 34.7 | 17.0 | 21.3 | 11.9 | 15.2 | 90.2 | 4.8 | 0 | 0.5 | 2.3 | 2.1 | 25.0 | 943 | 77.1 | 18.3 | 4.6 | 0 | 5.4 |
| 1001020200 | Alabama | Autauga | 2156 | 1059 | 1097 | 0.8 | 40.4 | 53.3 | 0.0 | 2.3 | 0.0 | 1662 | 32303 | 13538 | 18021 | 2474 | 25.5 | 40.3 | 22.3 | 24.7 | 21.5 | 9.4 | 22.0 | 86.3 | 13.1 | 0 | 0.0 | 0.7 | 0.0 | 23.4 | 753 | 77.0 | 16.9 | 6.1 | 0 | 13.3 |
| 1001020300 | Alabama | Autauga | 2968 | 1364 | 1604 | 0.0 | 74.5 | 18.6 | 0.5 | 1.4 | 0.3 | 2335 | 44922 | 5629 | 20689 | 2817 | 12.7 | 19.7 | 31.4 | 24.9 | 22.1 | 9.2 | 12.4 | 94.8 | 2.8 | 0 | 0.0 | 0.0 | 2.5 | 19.6 | 1373 | 64.1 | 23.6 | 12.3 | 0 | 6.2 |
| 1001020400 | Alabama | Autauga | 4423 | 2172 | 2251 | 10.5 | 82.8 | 3.7 | 1.6 | 0.0 | 0.0 | 3306 | 54329 | 7003 | 24125 | 2870 | 2.1 | 1.6 | 27.0 | 20.8 | 27.0 | 8.7 | 16.4 | 86.6 | 9.1 | 0 | 0.0 | 2.6 | 1.6 | 25.3 | 1782 | 75.7 | 21.2 | 3.1 | 0 | 10.8 |
| 1001020500 | Alabama | Autauga | 10763 | 4922 | 5841 | 0.7 | 68.5 | 24.8 | 0.0 | 3.8 | 0.0 | 7666 | 51965 | 6935 | 27526 | 2813 | 11.4 | 17.5 | 49.6 | 14.2 | 18.2 | 2.1 | 15.8 | 88.0 | 10.5 | 0 | 0.0 | 0.6 | 0.9 | 24.8 | 5037 | 67.1 | 27.6 | 5.3 | 0 | 4.2 |
| 1001020600 | Alabama | Autauga | 3851 | 1787 | 2064 | 13.1 | 72.9 | 11.9 | 0.0 | 0.0 | 0.0 | 2642 | 63092 | 9585 | 30480 | 7550 | 14.4 | 21.9 | 24.2 | 17.5 | 35.4 | 7.9 | 14.9 | 82.7 | 6.9 | 0 | 0.0 | 6.0 | 4.5 | 19.8 | 1560 | 79.4 | 14.7 | 5.8 | 0 | 10.9 |
Column information is given in metadata.
| CensusTract | Census.tract.ID | numeric |
|---|---|---|
| State | State, DC, or Puerto Rico | string |
| County | County or county equivalent | string |
| TotalPop | Total population | numeric |
| Men | Number of men | numeric |
| Women | Number of women | numeric |
| Hispanic | % of population that is Hispanic/Latino | numeric |
| White | % of population that is white | numeric |
| Black | % of population that is black | numeric |
| Native | % of population that is Native American or Native Alaskan | numeric |
| Asian | % of population that is Asian | numeric |
| Pacific | % of population that is Native Hawaiian or Pacific Islander | numeric |
| Citizen | Number of citizens | numeric |
| Income | Median household income ($) | numeric |
| IncomeErr | Median household income error ($) | numeric |
| IncomePerCap | Income per capita ($) | numeric |
| IncomePerCapErr | Income per capita error ($) | numeric |
| Poverty | % under poverty level | numeric |
| ChildPoverty | % of children under poverty level | numeric |
| Professional | % employed in management, business, science, and arts | numeric |
| Service | % employed in service jobs | numeric |
| Office | % employed in sales and office jobs | numeric |
| Construction | % employed in natural resources, construction, and maintenance | numeric |
| Production | % employed in production, transportation, and material movement | numeric |
| Drive | % commuting alone in a car, van, or truck | numeric |
| Carpool | % carpooling in a car, van, or truck | numeric |
| Transit | % commuting on public transportation | numeric |
| Walk | % walking to work | numeric |
| OtherTransp | % commuting via other means | numeric |
| WorkAtHome | % working at home | numeric |
| MeanCommute | Mean commute time (minutes) | numeric |
| Employed | % employed (16+) | numeric |
| PrivateWork | % employed in private industry | numeric |
| PublicWork | % employed in public jobs | numeric |
| SelfEmployed | % self-employed | numeric |
| FamilyWork | % in unpaid family work | numeric |
| Unemployment | % unemployed | numeric |
Remove summary rows from election.raw data: i.e.,
Federal-level summary into a election_federal.
State-level summary into a election_state.
Only county-level data is to be in election.
# We can see that some data in election.raw is not encoded correctly or is missing
# information. Specifically, FIPS 46102 of South Dakota has NA under county. We
# need to be sure to add that to county-level data. Also, FIPS 2000 of Alaska has
# NA under county. Oddly enough, the votes for FIPS 2000 and FIPS AK are exactly
# the same. There is also no other county-level data for Alaska. I deduced that the
# election data simply decided to group all the counties of Alaska into one. I
# decided to put this in the county-level data anyways, since there is no other
# county-level data on Alaska.
election_federal = election.raw %>% filter(fips=="US")
election_state = election.raw %>% filter(fips!="US", fips==as.character(state))
election = election.raw %>% filter(!is.na(county) | fips=="46102" | fips=="2000")How many named presidential candidates were there in the 2016 election? Draw a bar chart of all votes received by each candidate
nrow(election_federal)
## [1] 32
# There were 32 presidential candidates in the 2016 election.
ggplot(data=election_federal, aes(candidate, votes)) +
geom_bar(stat="identity", aes(fill=candidate)) +
ggtitle("Votes Received by Each Candidate") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Create variables county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).
election.percentages = election %>% group_by(fips) %>%
mutate(total=sum(votes), percent = votes/total)
county_winner = top_n(election.percentages, 1)
## Selecting by percent
election_state.percentages = election_state %>% group_by(fips) %>%
mutate(total=sum(votes), percent = votes/total)
state_winner = top_n(election_state.percentages, 1)
## Selecting by percentVisualization is crucial for gaining insight and intuition during data mining. We will map our data onto maps.
The R package ggplot2 can be used to draw maps. Consider the following code.
states = map_data("state")
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE) # color legend is unnecessary and takes too long
The variable states contain information to draw white polygons, and fill-colors are determined by region.
Draw county-level map by creating counties = map_data("county"). Color by county
counties = map_data("county")
ggplot(data = counties) +
geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE) # color legend is unnecessary and takes too long
Now color the map by the winning candidate for each state. First, combine states variable and state_winner we created earlier using left_join(). Note that left_join() needs to match up values of states to join the tables; however, they are in different formats: e.g. AZ vs. arizona. Before using left_join(), create a common column by creating a new column for states named fips = state.abb[match(some_column, some_function(state.name))]. Replace some_column and some_function to complete creation of this new column. Then left_join(). Your figure will look similar to state_level New York Times map.
states = states %>% mutate(fips = state.abb[match(region, tolower(state.name))])
states = left_join(states, state_winner)
## Joining, by = "fips"
## Warning: Column `fips` joining character vector and factor, coercing into
## character vector
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") +
coord_fixed(1.3)
The variable county does not have fips column. So we will create one by pooling information from maps::county.fips. Split the polyname column to region and subregion. Use left_join() combine county.fips into county. Also, left_join() previously created variable county_winner. Your figure will look similar to county-level New York Times map.
my.county.fips = county.fips %>% separate(polyname, c("region", "subregion"), ",")
my.county.fips = my.county.fips %>% mutate(fips=as.factor(fips))
my.county.fips = left_join(my.county.fips, counties)
## Joining, by = c("region", "subregion")
my.county.fips = left_join(my.county.fips, county_winner)
## Joining, by = "fips"
## Warning: Column `fips` joining factors with different levels, coercing to
## character vector
ggplot(data = my.county.fips) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") +
coord_fixed(1.3)
Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.
county_population = census %>%
mutate(region = tolower(State),
subregion = tolower(County)) %>%
group_by(region, subregion) %>%
summarise_at(vars(TotalPop),funs(sum(.)))
county_population = left_join(counties, county_population)
## Joining, by = c("region", "subregion")
ggplot(data = county_population) +
geom_polygon(aes(x = long, y = lat, fill = TotalPop, group = group), color = "white", size=0.05) +
coord_fixed(1.3) +
scale_fill_gradient(trans = "log10") +
ggtitle("Population Densities by County")
The census data contains high resolution information (more fine-grained than county-level).
In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:
Clean census data census.del: start with census, filter out any rows with missing values, convert {Men, Employed, Citizen} attributes to a percentages (meta data seems to be inaccurate), compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific}, remove {Walk, PublicWork, Construction}.
Many columns seem to be related, and, if a set that adds up to 100%, one column will be deleted.
Sub-county census data, census.subct: start with census.del from above, group_by() two attributes {State, County}, use add_tally() to compute CountyTotal. Also, compute the weight by TotalPop/CountyTotal.
County census data, census.ct: start with census.subct, use summarize_at() to compute weighted sum
Print few rows of census.ct:
census.del = na.omit(census) %>%
mutate(Men=Men/TotalPop,
Employed=Employed/TotalPop,
Citizen=Citizen/TotalPop,
Minority=Hispanic+Black+Native+Asian+Pacific) %>%
select(-Hispanic, -Black, -Native, -Asian, -Pacific,
-Walk, -PublicWork, -Construction, -Women)
census.subct =
census.del %>%
group_by(State, County) %>%
add_tally(wt=TotalPop)
colnames(census.subct)[30]="CountyTotal"
census.subct = census.subct %>%
mutate(Weight=TotalPop/CountyTotal)
census.ct = census.subct %>%
select(-CensusTract) %>%
summarise_all(funs(weighted.mean(., w=Weight)))
kable(census.ct %>% head)
| State | County | TotalPop | Men | White | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Production | Drive | Carpool | Transit | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | SelfEmployed | FamilyWork | Unemployment | Minority | CountyTotal | Weight |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alabama | Autauga | 6486.404 | 0.4843266 | 75.78823 | 0.7374912 | 51696.29 | 7771.009 | 24974.50 | 3433.674 | 12.91231 | 18.70758 | 32.79097 | 17.17044 | 24.28243 | 17.15713 | 87.50624 | 8.781235 | 0.0952590 | 1.3059687 | 1.8356531 | 26.50016 | 0.4343637 | 73.73649 | 5.433254 | 0.0000000 | 7.733726 | 22.53687 | 55221 | 0.1174626 |
| Alabama | Baldwin | 7698.957 | 0.4884866 | 83.10262 | 0.7569406 | 51074.36 | 8745.050 | 27316.84 | 3803.718 | 13.42423 | 19.48431 | 32.72994 | 17.95092 | 27.10439 | 11.32186 | 84.59861 | 8.959078 | 0.1266209 | 1.4438000 | 3.8504774 | 26.32218 | 0.4405113 | 81.28266 | 5.909353 | 0.3633269 | 7.589820 | 15.21426 | 195121 | 0.0394573 |
| Alabama | Barbour | 3325.195 | 0.5382816 | 46.23159 | 0.7691222 | 32959.30 | 6031.065 | 16824.22 | 2430.189 | 26.50563 | 43.55962 | 26.12404 | 16.46343 | 23.27878 | 23.31741 | 83.33021 | 11.056609 | 0.4954032 | 1.6217251 | 1.5019456 | 24.51828 | 0.3192113 | 71.59426 | 7.149837 | 0.0897742 | 17.525557 | 51.94382 | 26932 | 0.1234663 |
| Alabama | Bibb | 6380.718 | 0.5341090 | 74.49989 | 0.7739781 | 38886.63 | 5662.358 | 18430.99 | 3073.599 | 16.60375 | 27.19708 | 21.59010 | 17.95545 | 17.46731 | 23.74415 | 83.43488 | 13.153641 | 0.5031366 | 1.5620952 | 0.7314679 | 28.71439 | 0.3669262 | 76.74385 | 6.637936 | 0.3941515 | 8.163104 | 24.16597 | 22604 | 0.2822827 |
| Alabama | Blount | 7018.573 | 0.4940565 | 87.85385 | 0.7337550 | 46237.97 | 8695.786 | 20532.27 | 2052.055 | 16.72152 | 26.85738 | 28.52930 | 13.94252 | 23.83692 | 20.10413 | 84.85031 | 11.279222 | 0.3626321 | 0.4199411 | 2.2654133 | 34.84489 | 0.3844914 | 81.82671 | 4.228716 | 0.3564928 | 7.699640 | 10.59474 | 57710 | 0.1216180 |
| Alabama | Bullock | 4263.211 | 0.5300618 | 22.19918 | 0.7545420 | 33292.69 | 9000.345 | 17579.57 | 3110.645 | 24.50260 | 37.29116 | 19.55253 | 14.92420 | 20.17051 | 25.73547 | 74.77277 | 14.839127 | 0.7732160 | 1.8238247 | 3.0998783 | 28.63106 | 0.3619592 | 79.09065 | 5.273684 | 0.0000000 | 17.890026 | 76.53587 | 10678 | 0.3992518 |
Run PCA for both county & sub-county level data. Save the principal components data frames, call them ct.pc and subct.pc, respectively. What are the most prominent loadings of the first two principal components PC1 and PC2?
census.ct = census.ct %>% ungroup
census.subct = census.subct %>% ungroup
tmp.ct = census.ct %>%
select(-State, -County, -CountyTotal, -Weight)
tmp.subct = census.subct %>%
select(-CensusTract, -State, -County, -CountyTotal, -Weight)
# We should scale because variances of the different features will be
# vastly different.
ct.pc = prcomp(tmp.ct,
center=TRUE,
scale=TRUE)
subct.pc = prcomp(tmp.subct,
center=TRUE,
scale=TRUE)
biplot(ct.pc)
biplot(subct.pc)
# At a glance, it appears that the Income and Minority loadings among others
# have a relatively large magnitude on PC1 and PC2.With census.ct, perform hierarchical clustering using Euclidean distance metric complete linkage to find 10 clusters. Repeat clustering process with the first 5 principal components of ct.pc. Compare and contrast clusters containing San Mateo County. Can you hypothesize why this would be the case?
dist.ct = dist(ct.pc$x)
hc.ct = hclust(dist.ct, method="complete")
ct.pc5 = ct.pc$x[,1:5]
dist.ct5 = dist(ct.pc5)
hc.ct5 = hclust(dist.ct5, method="complete")
col_vector = c("#000000", "#FF0000", "#608E46", "#F3DD00", "#87FF00",
"#00FFD1", "#0093FF", "#0013FF", "#C500FF", "#FF008B")
# Plot all PCs
cols_t1<-col_vector[cutree(hc.ct, k=10)]
plot(ct.pc$x, col=cols_t1)
legend(x=-10, y=10, legend=c("Trump", "Clinton"),
col=c("red", "blue"), lty=1, cex=0.8)
# Plot first 5 PCs
cols_t2<-col_vector[cutree(hc.ct5, k=10)]
plot(ct.pc5, col=cols_t2)
legend(x=-10, y=10, legend=c("Trump", "Clinton"),
col=c("red", "black"), lty=1, cex=0.8)
summary(ct.pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.5836 1.9248 1.8166 1.30618 1.09784 1.07422
## Proportion of Variance 0.2567 0.1425 0.1269 0.06562 0.04636 0.04438
## Cumulative Proportion 0.2567 0.3992 0.5262 0.59177 0.63812 0.68251
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.07053 0.99209 0.93460 0.88405 0.86656 0.78071
## Proportion of Variance 0.04408 0.03786 0.03359 0.03006 0.02888 0.02344
## Cumulative Proportion 0.72658 0.76444 0.79804 0.82810 0.85698 0.88042
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.70711 0.69387 0.62992 0.59578 0.56022 0.54534
## Proportion of Variance 0.01923 0.01852 0.01526 0.01365 0.01207 0.01144
## Cumulative Proportion 0.89965 0.91817 0.93343 0.94708 0.95915 0.97059
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.4646 0.42231 0.3678 0.31065 0.24869 0.21151
## Proportion of Variance 0.0083 0.00686 0.0052 0.00371 0.00238 0.00172
## Cumulative Proportion 0.9789 0.98575 0.9910 0.99467 0.99705 0.99877
## PC25 PC26
## Standard deviation 0.17221 0.04921
## Proportion of Variance 0.00114 0.00009
## Cumulative Proportion 0.99991 1.00000
# I don't know how to find San Mateo County specifically, but I imagine
# that due to using less Principal Components, we have put San Mateo County
# in the wrong cluster! We can see from the summary that 5 PCs only accounts
# for 60% of the variance. We lost a lot of information, so finding San Mateo
# or any other county in a different cluster from using all PCs would not be
# surprisingIn order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. Following code makes necessary changes to merge them into election.cl for classification.
tmpwinner = county_winner %>% ungroup %>%
mutate(state = state.name[match(state, state.abb)]) %>% ## state abbreviations
mutate_at(vars(state, county), tolower) %>% ## to all lowercase
mutate(county = gsub(" county| columbia| city| parish", "", county)) ## remove suffixes
tmpcensus = census.ct %>% mutate_at(vars(State, County), tolower)
election.cl = tmpwinner %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
## saves meta information to attributes
attr(election.cl, "location") = election.cl %>% select(c(county, fips, state, votes, percent))
election.cl = election.cl %>% select(-c(county, fips, state, votes, percent, total, CountyTotal, Weight))
Using the following code, partition data into 80% training and 20% testing:
set.seed(10)
n = nrow(election.cl)
in.trn= sample.int(n, 0.8*n)
trn.cl = election.cl[ in.trn,]
tst.cl = election.cl[-in.trn,]
Using the following code, define 10 cross-validation folds:
set.seed(20)
nfold = 10
folds = sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))
Using the following error rate function:
calc_error_rate = function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","knn","glm")
Decision tree: train a decision tree by cv.tree(). Prune tree to minimize misclassification. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to records variable.
election_tree = tree(candidate~.,
data = trn.cl,
control = tree.control(nobs = nrow(trn.cl), minsize = 6, mindev = 1e-6))
draw.tree(election_tree, cex=0.6, nodeinfo=TRUE)
cv = cv.tree(election_tree, rand=folds, FUN=prune.misclass, K=nfold)
best.size.cv = min(cv$size[which(cv$dev == min(cv$dev))])
election_tree.pruned = prune.tree(election_tree, best = best.size.cv)
draw.tree(election_tree.pruned, cex = 0.6, nodeinfo=TRUE)
train.pred = predict(election_tree.pruned, trn.cl[,-1], type = 'class')
test.pred = predict(election_tree.pruned, tst.cl[,-1], type = 'class')
error.training.dt <- calc_error_rate(train.pred, trn.cl$candidate)
error.testing.dt <- calc_error_rate(test.pred, tst.cl$candidate)
records[1,1] <- error.training.dt
records[1,2] <- error.testing.dt
records
## train.error test.error
## tree 0.07084691 0.08469055
## knn NA NA
## glm NA NAK-nearest neighbor: train a KNN model for classification. Use cross-validation to determine the best number of neighbors, and plot number of neighbors vs. resulting training and validation errors. Compute test error and save to records.
# do.chunk() for k-fold Cross-validation
do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){ # Function arguments
train = (folddef!=chunkid) # Get training index
Xtr = Xdat[train,] # Get training set by the above index
Ytr = Ydat[train] # Get true labels in training set
Xvl = Xdat[!train,] # Get validation set
Yvl = Ydat[!train] # Get true labels in validation set
predYtr = knn(train=Xtr, test=Xtr, cl=Ytr, k = k) # Predict training labels
predYvl = knn(train=Xtr, test=Xvl, cl=Ytr, k = k) # Predict validation labels
data.frame(train.error = mean(predYtr != Ytr), # Training error for each fold
val.error = mean(predYvl != Yvl)) # Validation error for each fold
}
# Set error.folds (a vector) to save validation errors in future
error.folds = NULL
# Give possible number of nearest neighbours to be considered
allK = c(1, seq(10,50,length.out=9))
# Set seed since do.chunk() contains a random component induced by knn()
set.seed(888)
# Loop through different number of neighbors
for (j in allK){
tmp = plyr::ldply(1:nfold, do.chunk, # Apply do.chunk() function to each fold
folddef=folds, Xdat=election.cl[,-1], Ydat=election.cl$candidate, k=j)
# Necessary arguments to be passed into do.chunk
tmp$neighbors = j # Keep track of each value of neighors
error.folds = rbind(error.folds, tmp) # combine results
}
# Transform the format of error.folds for further convenience
errors = reshape2::melt(error.folds, id.vars=c('neighbors'), value.name='error')
# Choose the number of neighbors which minimizes validation error
val.error.means = errors %>%
# Select all rows of validation errors
filter(variable=='val.error') %>%
# Group the selected data frame by neighbors
group_by(neighbors, variable) %>%
# Calculate CV error rate for each k
summarise_all(funs(mean)) %>%
# Remove existing group
ungroup() %>%
filter(error==min(error))
# Best number of neighbors
# if there is a tie, pick larger number of neighbors for simpler model
numneighbor = max(val.error.means$neighbors)
numneighbor
## [1] 20
train.pred <- knn(train = trn.cl[,-1], test = trn.cl[,-1], cl = trn.cl$candidate, k = numneighbor)
test.pred <- knn(train = trn.cl[,-1], test = tst.cl[,-1], cl = trn.cl$candidate, k = numneighbor)
error.training.dt <- calc_error_rate(train.pred, trn.cl$candidate)
error.testing.dt <- calc_error_rate(test.pred, tst.cl$candidate)
records[2,1] <- error.training.dt
records[2,2] <- error.testing.dt
records
## train.error test.error
## tree 0.07084691 0.08469055
## knn 0.13070033 0.15146580
## glm NA NAInstead of using the native attributes, we can use principal components in order to train our classification models. After this section, a comparison will be made between classification model performance between using native attributes and principal components.
pca.records = matrix(NA, nrow=3, ncol=2)
colnames(pca.records) = c("train.error","test.error")
rownames(pca.records) = c("tree","knn","glm")
Compute principal components from the independent variables in training data. Then, determine the number of minimum number of PCs needed to capture 90% of the variance. Plot proportion of variance explained.
# Scale since variances of each feature are of vastly different magnitudes
election.pca = prcomp(trn.cl[,-1], scale=TRUE)
summary(election.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.509 1.9833 1.8322 1.2490 1.14904 1.1007 1.05190
## Proportion of Variance 0.242 0.1513 0.1291 0.0600 0.05078 0.0466 0.04256
## Cumulative Proportion 0.242 0.3933 0.5224 0.5824 0.63322 0.6798 0.72237
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.98043 0.92586 0.91111 0.8624 0.78449 0.72848
## Proportion of Variance 0.03697 0.03297 0.03193 0.0286 0.02367 0.02041
## Cumulative Proportion 0.75934 0.79231 0.82424 0.8528 0.87652 0.89693
## PC14 PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.69622 0.64011 0.61214 0.59291 0.54889 0.46333
## Proportion of Variance 0.01864 0.01576 0.01441 0.01352 0.01159 0.00826
## Cumulative Proportion 0.91557 0.93133 0.94574 0.95926 0.97085 0.97911
## PC20 PC21 PC22 PC23 PC24 PC25
## Standard deviation 0.42022 0.38365 0.3101 0.23525 0.18595 0.17459
## Proportion of Variance 0.00679 0.00566 0.0037 0.00213 0.00133 0.00117
## Cumulative Proportion 0.98590 0.99156 0.9953 0.99739 0.99872 0.99989
## PC26
## Standard deviation 0.05359
## Proportion of Variance 0.00011
## Cumulative Proportion 1.00000
# We can see that it takes 14 PCs to explain 90% of the variance
election.pca.out = summary(election.pca)
election.pca.pve = election.pca.out$importance[2,]
plot(election.pca.pve, xlab="Principal Component",
ylab="Variance Explained", main="Proportion of Variance Explained",
ylim=c(0,1), type='b')
election.pca.cve = election.pca.out$importance[3,]
plot(election.pca.cve, xlab="Principal Component",
ylab="Variance Explained", main="Cumulative Proportion of Variance Explained",
ylim=c(0,1), type='b')
Create a new training data by taking class labels and principal components. Call this variable tr.pca. Create the test data based on principal component loadings: i.e., transforming independent variables in test data to principal components space. Call this variable test.pca.
tr.pca = data.frame(trn.cl$candidate, election.pca$x[,1:14])
colnames(tr.pca)[1]="candidate"
election.test.pca = prcomp(tst.cl[,-1], scale=TRUE)
test.pca = data.frame(tst.cl$candidate, election.test.pca$x[,1:14])
colnames(test.pca)[1]="candidate"Decision tree: repeat training of decision tree models using principal components as independent variables. Record resulting errors.
pca_tree = tree(candidate~.,
data = tr.pca,
control = tree.control(nobs = nrow(trn.cl), minsize = 6, mindev = 1e-6))
draw.tree(pca_tree, cex=0.6, nodeinfo=TRUE)
cv = cv.tree(pca_tree, rand=folds, FUN=prune.misclass, K=nfold)
pca.size.cv = min(cv$size[which(cv$dev == min(cv$dev))])
pca_tree.pruned = prune.tree(pca_tree, best = pca.size.cv)
draw.tree(pca_tree.pruned, cex = 0.6, nodeinfo=TRUE)
train.pred = predict(pca_tree.pruned, tr.pca[,-1], type = 'class')
test.pred = predict(pca_tree.pruned, test.pca[,-1], type = 'class')
error.training.dt <- calc_error_rate(train.pred, tr.pca$candidate)
error.testing.dt <- calc_error_rate(test.pred, test.pca$candidate)
pca.records[1,1] <- error.training.dt
pca.records[1,2] <- error.testing.dt
pca.records
## train.error test.error
## tree 0.04723127 0.2247557
## knn NA NA
## glm NA NAK-nearest neighbor: repeat training of KNN classifier using principal components as independent variables. Record resulting errors.
# Set error.folds (a vector) to save validation errors in future
pca.error.folds = NULL
# Set seed since do.chunk() contains a random component induced by knn()
set.seed(888)
# Loop through different number of neighbors
for (j in allK){
tmp = plyr::ldply(1:nfold, do.chunk, # Apply do.chunk() function to each fold
folddef=folds, Xdat=tr.pca[,-1], Ydat=tr.pca$candidate, k=j)
# Necessary arguments to be passed into do.chunk
tmp$neighbors = j # Keep track of each value of neighors
pca.error.folds = rbind(pca.error.folds, tmp) # combine results
}
# Transform the format of error.folds for further convenience
errors = reshape2::melt(error.folds, id.vars=c('neighbors'), value.name='error')
# Choose the number of neighbors which minimizes validation error
val.error.means = errors %>%
# Select all rows of validation errors
filter(variable=='val.error') %>%
# Group the selected data frame by neighbors
group_by(neighbors, variable) %>%
# Calculate CV error rate for each k
summarise_all(funs(mean)) %>%
# Remove existing group
ungroup() %>%
filter(error==min(error))
# Best number of neighbors
# if there is a tie, pick larger number of neighbors for simpler model
numneighbor = max(val.error.means$neighbors)
numneighbor
## [1] 20
train.pred <- knn(train = tr.pca[,-1], test = tr.pca[,-1], cl = tr.pca$candidate, k = numneighbor)
test.pred <- knn(train = tr.pca[,-1], test = test.pca[,-1], cl = tr.pca$candidate, k = numneighbor)
error.training.dt <- calc_error_rate(train.pred, trn.cl$candidate)
error.testing.dt <- calc_error_rate(test.pred, tst.cl$candidate)
pca.records[2,1] <- error.training.dt
pca.records[2,2] <- error.testing.dt
pca.records
## train.error test.error
## tree 0.04723127 0.2247557
## knn 0.06840391 0.2231270
## glm NA NAThis is an open question. Interpret and discuss any insights gained and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does/doesn’t seems reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc)
Propose and tackle at least one interesting question. Be creative! Some possibilities are:
Data preprocessing: we aggregated sub-county level data before performing classification. Would classification at the sub-county level before determining the winner perform better? What implicit assumptions are we making?
Feature engineering: would a non-linear classification method perform better? Would you use native features or principal components?
Additional classification methods: logistic regression, LDA, QDA, SVM, random forest, etc. (You may use methods beyond this course). How do these compare to KNN and tree method?
Bootstrap: Perform boostrap to generate plots similar to Figure 4.10/4.11. Discuss the results.
# We can try to try to use glm, but first, we need to do it in a binary class setting
# We currently have a multiclass problem. It looks like at the county level, only Trump
# and Clinton were winners, but their factor levels are not binary.
candidate = droplevels(election.cl)$candidate
election.cl.new = data.frame(candidate, election.cl %>% select(-candidate))
set.seed(10)
n = nrow(election.cl.new)
in.trn= sample.int(n, 0.8*n)
trn.cl.new = election.cl.new[ in.trn,]
tst.cl.new = election.cl.new[-in.trn,]
election.glm <- glm(candidate~., family = binomial('logit'), data = trn.cl.new)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred.glm.train <- predict(election.glm, trn.cl.new, type = 'response')
pred.glm.test <- predict(election.glm, tst.cl.new, type = 'response')
cl.glm.train <- ifelse(pred.glm.train > 0.5, 'Hillary Clinton', 'Donald Trump')
cl.glm.test <- ifelse(pred.glm.test > 0.5, 'Hillary Clinton', 'Donald Trump')
error.glm.train <- calc_error_rate(cl.glm.train, trn.cl.new$candidate)
error.glm.test <- calc_error_rate(cl.glm.test, tst.cl.new$candidate)
records[3,1] <- error.glm.train
records[3,2] <- error.glm.test
records
## train.error test.error
## tree 0.07084691 0.08469055
## knn 0.13070033 0.15146580
## glm 0.06473941 0.07491857
# Compared to Decision Tree and KNN, Logistic Regression does the best.
# With KNN doing worst, our models seem to suggest that the true decision
# boundaries are closer to linear than non linear.
election.pca.new = prcomp(trn.cl.new[,-1], scale=TRUE)
tr.pca.new = data.frame(trn.cl.new$candidate, election.pca.new$x[,1:14])
colnames(tr.pca.new)[1]="candidate"
election.test.pca.new = prcomp(tst.cl.new[,-1], scale=TRUE)
test.pca.new = data.frame(tst.cl.new$candidate, election.test.pca.new$x[,1:14])
colnames(test.pca.new)[1]="candidate"
election.glm <- glm(candidate~., family = binomial('logit'), data = tr.pca.new)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred.glm.train <- predict(election.glm, tr.pca.new, type = 'response')
pred.glm.test <- predict(election.glm, test.pca.new, type = 'response')
cl.glm.train <- ifelse(pred.glm.train > 0.5, 'Hillary Clinton', 'Donald Trump')
cl.glm.test <- ifelse(pred.glm.test > 0.5, 'Hillary Clinton', 'Donald Trump')
error.glm.train <- calc_error_rate(cl.glm.train, tr.pca.new$candidate)
error.glm.test <- calc_error_rate(cl.glm.test, test.pca.new$candidate)
pca.records[3,1] <- error.glm.train
pca.records[3,2] <- error.glm.test
pca.records
## train.error test.error
## tree 0.04723127 0.2247557
## knn 0.06840391 0.2231270
## glm 0.08021173 0.3387622
# Under PCA Logistic Regression seems to perform the worst out of the tree.
# It's not surprising that CL logistic regression is better than that of PCA
# but I do find it surprising that it went from the best classifier to the
# worst. It seems to imply that when doing logistic regression, we need more
# information to build a better model.